19240: fix: #10020 make concat function support array concatenation#113
19240: fix: #10020 make concat function support array concatenation#113martin-augment wants to merge 1 commit intomainfrom
Conversation
WalkthroughThe concat function in datafusion/functions/src/string/concat.rs has been modified to support array-aware concatenation. The signature initialization changed from ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| use arrow::datatypes::DataType::*; | ||
| args.iter().any(|arg| match arg { | ||
| ColumnarValue::Array(arr) => matches!(arr.data_type(), List(_)), | ||
| ColumnarValue::Scalar(scalar) => matches!(scalar, ScalarValue::List(_, _)), |
There was a problem hiding this comment.
This pattern won’t match: ScalarValue::List takes a single Arc<ListArray> (and there are separate LargeList/FixedSizeList variants). As written, scalar list arguments won’t be detected.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the ScalarValue::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.
| fn has_array_args(args: &[ColumnarValue]) -> bool { | ||
| use arrow::datatypes::DataType::*; | ||
| args.iter().any(|arg| match arg { | ||
| ColumnarValue::Array(arr) => matches!(arr.data_type(), List(_)), |
There was a problem hiding this comment.
Only DataType::List(_) is checked; consider including LargeList(_) and FixedSizeList(_) so all list-like arrays are recognized.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the DataType::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.
| let args = args.into_iter() | ||
| .map(|arg| { | ||
| // If the argument is not already an array, wrap it in an array | ||
| if !matches!(arg.get_type(&DataType::Null).unwrap_or_default(), DataType::List(_)) { |
There was a problem hiding this comment.
Using expr.get_type without a schema (&DataType::Null) won’t determine if arg is a list and risks wrapping list args in MakeArray, producing nested lists. You likely need schema‑aware type resolution or a different check here.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed code does not even compile. The expected parameter is schema: &dyn ExprSchema, not &DataType.
| use DataType::*; | ||
|
|
||
| // If any argument is an array, return the array type | ||
| if arg_types.iter().any(|t| matches!(t, List(_))) { |
There was a problem hiding this comment.
Array detection in return_type only checks List(_); it should likely also handle LargeList(_) and FixedSizeList(_) to return the correct array type. (Also applies to the if let Some(DataType::List(...)) below.)
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the DataType::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.
Code Review for PR #113: fix: apache#10020 make concat function support array concatenationThank you for working on this feature! I've reviewed the changes and identified several critical issues that need to be addressed before this can be merged. 🔴 Critical Issues1. Compilation Error - Missing closing brace (Line 72)Self {
signature: Signature::variadic_any(Volatility::Immutable)
), // ❌ Missing opening brace, extra closing paren
}Should be: Self {
signature: Signature::variadic_any(Volatility::Immutable),
}2. Missing closing brace for impl block (Line 107)The 3. Non-existent Methods Used (Lines 151-153)let exprs: Vec<Expr> = args.args.into_iter().map(|a| a.into_expr()).collect();
let expr = Self::to_array_concat(exprs)?;
return expr.eval(args.context);Problems:
4. Incorrect Use of
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/functions/src/string/concat.rs (1)
67-74: Syntax error: Mismatched parenthesis.Line 72 contains a stray
)that doesn't match any opening parenthesis. This will cause a compilation error.pub fn new() -> Self { use DataType::*; Self { - signature: Signature::variadic_any(Volatility::Immutable) - ), + signature: Signature::variadic_any(Volatility::Immutable), } }
🧹 Nitpick comments (1)
datafusion/functions/src/string/concat.rs (1)
18-35: Unused imports detected.Several imports appear to be unused in this file:
TimeUnit(line 19) - not referenced anywhereexec_err(line 30) - not referenced anywhereScalarFunctionExpr(line 31) - not referenced anywhereConsider removing these to keep the imports clean.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
datafusion/functions/src/string/concat.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.
Applied to files:
datafusion/functions/src/string/concat.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: Analyze (rust)
| // Check if any argument is an array type | ||
| fn has_array_args(args: &[ColumnarValue]) -> bool { | ||
| use arrow::datatypes::DataType::*; | ||
| args.iter().any(|arg| match arg { | ||
| ColumnarValue::Array(arr) => matches!(arr.data_type(), List(_)), | ||
| ColumnarValue::Scalar(scalar) => matches!(scalar, ScalarValue::List(_, _)), | ||
| }) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the ScalarValue enum definition
rg -n "enum ScalarValue" --type rust -A 50 | head -100Repository: martin-augment/datafusion
Length of output: 3784
🏁 Script executed:
#!/bin/bash
# Search for ScalarValue::List pattern usage in the codebase
rg "ScalarValue::List\(" --type rust -B2 -A2 | head -50Repository: martin-augment/datafusion
Length of output: 3735
🏁 Script executed:
#!/bin/bash
# Look specifically in arrow crate or where ScalarValue is defined
fd -e rs | xargs rg -l "enum ScalarValue" | head -5Repository: martin-augment/datafusion
Length of output: 104
🏁 Script executed:
#!/bin/bash
# Get the full ScalarValue enum definition including List variant
sed -n '324,450p' datafusion/common/src/scalar/mod.rsRepository: martin-augment/datafusion
Length of output: 5239
Fix incorrect pattern matching for ScalarValue::List.
The pattern ScalarValue::List(_, _) on line 81 is incorrect. The List variant takes a single Arc<ListArray> parameter, not two. Change it to ScalarValue::List(_).
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 76 to 83, the pattern
match for ScalarValue is incorrect: it uses ScalarValue::List(_, _) but the List
variant takes a single argument; change the match arm to ScalarValue::List(_) so
it matches the single Arc<ListArray> parameter (leave the Array case as-is).
| // Convert arguments to array_concat function | ||
| fn to_array_concat(args: Vec<Expr>) -> Result<Expr> { | ||
| let array_concat = BuiltinScalarFunction::ArrayConcat; | ||
| let args = args.into_iter() | ||
| .map(|arg| { | ||
| // If the argument is not already an array, wrap it in an array | ||
| if !matches!(arg.get_type(&DataType::Null).unwrap_or_default(), DataType::List(_)) { | ||
| Ok(Expr::ScalarFunction(ScalarFunction::new( | ||
| BuiltinScalarFunction::MakeArray, | ||
| vec![arg], | ||
| ))) | ||
| } else { | ||
| Ok(arg) | ||
| } | ||
| }) | ||
| .collect::<Result<Vec<_>>>()?; | ||
|
|
||
| Ok(Expr::ScalarFunction(ScalarFunction::new( | ||
| array_concat, | ||
| args, | ||
| ))) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the actual file and understand the context
cd datafusion/functions/src/string
wc -l concat.rsRepository: martin-augment/datafusion
Length of output: 82
🏁 Script executed:
# Read the specific lines mentioned in the review
sed -n '85,106p' concat.rsRepository: martin-augment/datafusion
Length of output: 121
🏁 Script executed:
# Check if BuiltinScalarFunction exists and how it's used
rg -n "BuiltinScalarFunction" --type rust -C 2 datafusion/ | head -150Repository: martin-augment/datafusion
Length of output: 1597
🏁 Script executed:
# Check Expr::get_type signature
rg -n "fn get_type" --type rust -A 3 datafusion/ | head -100Repository: martin-augment/datafusion
Length of output: 2945
🏁 Script executed:
# Check ScalarFunction::new signature and usage
rg -n "ScalarFunction::new" --type rust -C 2 datafusion/ | head -100Repository: martin-augment/datafusion
Length of output: 8058
🏁 Script executed:
# Check how array_concat is currently implemented
rg -n "array_concat" --type rust -C 3 datafusion/ | head -80Repository: martin-augment/datafusion
Length of output: 6547
🏁 Script executed:
# Check for MakeArray implementation
rg -n "MakeArray" --type rust -C 3 datafusion/ | head -100Repository: martin-augment/datafusion
Length of output: 2933
🏁 Script executed:
# Check the ScalarFunction::new signature
rg -n "fn new\(" --type rust datafusion/expr/src/expr/ -A 2 | grep -A 2 "ScalarFunction"Repository: martin-augment/datafusion
Length of output: 134
🏁 Script executed:
# Verify if ScalarFunction::new still exists with BuiltinScalarFunction
rg -n "impl ScalarFunction" --type rust datafusion/expr/src/expr/ -A 10Repository: martin-augment/datafusion
Length of output: 134
🏁 Script executed:
# Check what the correct approach should be - look at array_concat and make_array helper functions
rg -n "pub fn array_concat" --type rust datafusion/functions-nested/src/ -A 5Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
# Check make_array helper function
rg -n "pub fn make_array" --type rust datafusion/functions-nested/src/ -A 5Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
# Find ScalarFunction definition
rg -n "struct ScalarFunction" --type rust -A 5 datafusion/Repository: martin-augment/datafusion
Length of output: 1965
🏁 Script executed:
# Find the public helper functions in functions-nested
find datafusion/functions-nested/src -name "*.rs" -type f | head -20Repository: martin-augment/datafusion
Length of output: 939
🏁 Script executed:
# Check expr_fn module for public helpers
rg -n "pub fn array_concat\|pub fn make_array" --type rust datafusion/functions-nested/Repository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
# Check what's in expr_fn.rs
cat datafusion/functions-nested/src/expr_fn.rs | head -100Repository: martin-augment/datafusion
Length of output: 143
🏁 Script executed:
# Find ScalarFunction impl and its methods
rg -n "impl ScalarFunction" --type rust -A 30 datafusion/expr/src/expr.rsRepository: martin-augment/datafusion
Length of output: 1287
🏁 Script executed:
# Check functions-nested lib.rs for public exports
cat datafusion/functions-nested/src/lib.rs | head -150Repository: martin-augment/datafusion
Length of output: 5356
🏁 Script executed:
# Check the macros to understand make_udf_expr_and_func
rg -n "make_udf_expr_and_func" --type rust -A 20 datafusion/functions-nested/src/macros.rs | head -80Repository: martin-augment/datafusion
Length of output: 1936
🏁 Script executed:
# Double-check: search for any usage of ScalarFunction::new with non-new_udf pattern
rg -n "ScalarFunction::new\(" --type rust datafusion/expr/src/expr.rsRepository: martin-augment/datafusion
Length of output: 51
🏁 Script executed:
# Verify the old BuiltinScalarFunction is still in codebase but unused in new patterns
rg -n "enum BuiltinScalarFunction" --type rust datafusion/ -A 5Repository: martin-augment/datafusion
Length of output: 51
This code will not compile—multiple critical API mismatches.
-
Line 91:
arg.get_type(&DataType::Null)has wrong signature.Expr::get_type()requires a&dyn ExprSchemaparameter, not a&DataType. -
Lines 92-95, 102-105:
ScalarFunction::new(BuiltinScalarFunction, args)does not exist. The only public constructor isScalarFunction::new_udf(udf: Arc<ScalarUDF>, args). -
Lines 87, 93:
BuiltinScalarFunction::ArrayConcatandBuiltinScalarFunction::MakeArrayare not the correct types to use. Both functions are implemented as properScalarUDFinstances indatafusion-functions-nested.
Fix: Use the public helper functions from datafusion_functions_nested::expr_fn:
- Replace the entire
to_array_concatimplementation with calls toarray_concat()andmake_array()instead of manually constructing expressions withBuiltinScalarFunctionand incorrect API calls.
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 85 to 106, the current
to_array_concat implementation uses incorrect APIs (Expr::get_type signature,
ScalarFunction::new constructor, and BuiltinScalarFunction variants) and won't
compile; replace the entire function body to use the public helpers from
datafusion_functions_nested::expr_fn: call make_array(expr) to wrap non-list
arguments and then call array_concat(vec_of_exprs) to build the concatenation
expression (import the two helper functions), removing any direct use of
BuiltinScalarFunction or ScalarFunction::new and any incorrect get_type calls.
| } | ||
|
|
||
| impl ScalarUDFImpl for ConcatFunc { |
There was a problem hiding this comment.
Missing closing brace for impl ConcatFunc block.
The impl ConcatFunc block starting at line 67 is not closed before the impl ScalarUDFImpl block begins at line 108. A closing } is needed after the to_array_concat function.
Ok(Expr::ScalarFunction(ScalarFunction::new(
array_concat,
args,
)))
}
+}
impl ScalarUDFImpl for ConcatFunc {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } | |
| impl ScalarUDFImpl for ConcatFunc { | |
| } | |
| } | |
| impl ScalarUDFImpl for ConcatFunc { |
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 106 to 108, the impl
ConcatFunc block opened earlier (around line 67) is missing its closing brace;
add a single closing curly brace `}` immediately after the to_array_concat
function to terminate the impl ConcatFunc block before the subsequent impl
ScalarUDFImpl begins so the file compiles and the impl blocks are properly
separated.
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| use DataType::*; | ||
|
|
||
| // If any argument is an array, return the array type | ||
| if arg_types.iter().any(|t| matches!(t, List(_))) { | ||
| // Find the first array type to use as the base | ||
| if let Some(DataType::List(field)) = arg_types.iter().find(|t| matches!(t, List(_))) { | ||
| return Ok(DataType::List(field.clone())); | ||
| } | ||
| } | ||
|
|
||
| // Otherwise, use the existing string type logic | ||
| let mut dt = &Utf8; | ||
| arg_types.iter().for_each(|data_type| { | ||
| for data_type in arg_types { | ||
| if data_type == &Utf8View { | ||
| dt = data_type; | ||
| } | ||
| if data_type == &LargeUtf8 && dt != &Utf8View { | ||
| dt = data_type; | ||
| } | ||
| }); | ||
|
|
||
| Ok(dt.to_owned()) | ||
| } | ||
| Ok(dt.clone()) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the concat.rs file more closely to understand the context
cd datafusion/functions/src/string && wc -l concat.rsRepository: martin-augment/datafusion
Length of output: 82
🏁 Script executed:
# Check the full concat function implementation to understand the context
cd datafusion/functions/src/string && head -160 concat.rs | tail -60Repository: martin-augment/datafusion
Length of output: 2014
🏁 Script executed:
# Search for tests related to array concatenation
find . -name "*.rs" -type f -exec grep -l "concat.*List\|array.*concat" {} \; | head -20Repository: martin-augment/datafusion
Length of output: 726
🏁 Script executed:
# Look for type coercion logic in DataFusion
rg "coerce.*type|type.*coerce" --type rs -A 3 -B 3 | head -100Repository: martin-augment/datafusion
Length of output: 95
🌐 Web query:
DataFusion concat array type coercion behavior
💡 Result:
Short answer: array_concat (and the array/list concat UDFs) require the input arrays to have element types that can be coerced to a single common element type; DataFusion will pick a common promoted/compatible Arrow DataType and (implicitly) cast elements as needed using DataFusion/Arrow cast rules. If types are not coercible the planner will error. Note also the SQL concat operator (string concat) is different — when given list/array arguments it may stringify them instead of performing array concatenation; use array_concat / list_concat for real array merge. [1][2][3][4]
Sources:
- array/list function docs (array_concat signature and description). [1]
- type-coercion / function-signature notes (arrays must be coercible to same inner type). [2]
- bug/discussion showing concat vs array_concat behavior (arrays being stringified). [3]
- Arrow/DataFusion cast behavior (how casts handle Utf8⇄numeric/boolean and list casts). [4]
🏁 Script executed:
# Check the proper array_concat implementation in functions-nested
cat datafusion/functions-nested/src/concat.rs | head -200Repository: martin-augment/datafusion
Length of output: 5941
🏁 Script executed:
# Look for type coercion logic in the planner
rg "coerce_types_for_function|type_coerce" --type rs -A 5 | head -150Repository: martin-augment/datafusion
Length of output: 95
🏁 Script executed:
# Search for tests related to array concatenation
rg "array_concat|concat.*List" --type rs tests/ | head -30Repository: martin-augment/datafusion
Length of output: 95
🏁 Script executed:
# Check if there are integration tests for concat with arrays
find . -name "*.rs" -path "*/tests/*" -exec grep -l "array_concat\|concat.*\[" {} \; | head -10Repository: martin-augment/datafusion
Length of output: 363
🏁 Script executed:
# Get the array_concat implementation in functions-nested
sed -n '200,400p' datafusion/functions-nested/src/concat.rsRepository: martin-augment/datafusion
Length of output: 5908
🏁 Script executed:
# Search for ArrayConcat struct and return_type implementation
rg "struct ArrayConcat|impl.*ArrayConcat" --type rs -A 30 datafusion/functions-nested/src/concat.rs | head -80Repository: martin-augment/datafusion
Length of output: 95
🏁 Script executed:
# Get the return_type method from array_concat
rg "fn return_type" --type rs -B 5 -A 20 datafusion/functions-nested/src/concat.rsRepository: martin-augment/datafusion
Length of output: 95
🏁 Script executed:
# Look at how type coercion is done for arrays
rg "ListCoercion|coerce_types_for_function" --type rs -B 3 -A 10 | head -100Repository: martin-augment/datafusion
Length of output: 95
🏁 Script executed:
# Get the full string concat implementation
cat datafusion/functions/src/string/concat.rs | head -250Repository: martin-augment/datafusion
Length of output: 9541
🏁 Script executed:
# Look at the to_array_concat method and how it's used
rg "to_array_concat|has_array_args" --type rs -B 5 -A 15 datafusion/functions/src/string/concat.rsRepository: martin-augment/datafusion
Length of output: 95
🏁 Script executed:
# Check if there's type coercion happening in the planner for this
rg "concat.*coerce|string.*concat.*type" --type rs datafusion/ | head -20Repository: martin-augment/datafusion
Length of output: 95
Type signature mismatch: return_type() doesn't match array_concat's actual type coercion behavior.
The return_type() method returns the first List field found without type unification. When delegating to array_concat() in invoke_with_args(), the actual return type may differ due to array_concat's proper type promotion (via type_union_resolution()). For example, concat(List<Int32>, List<Int64>) would declare return type List<Int32> but actually return List<Int64>, causing type signature mismatches. Use type_union_resolution() in return_type() to match array_concat's behavior, similar to the proper implementation in functions-nested.
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 121 to 143, the
return_type() currently picks the first List field found which can disagree with
array_concat()'s actual promoted return type; instead collect the element
DataType of every List(_) argument, call type_union_resolution() (the same
promotion used by array_concat) on those element types to produce a unified
field, and return DataType::List(unified_field); if resolution fails propagate
the error. Keep the existing non-list string promotion logic unchanged for the
Utf8/LargeUtf8 path.
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| // Check if any argument is an array | ||
| if Self::has_array_args(&args.args) { | ||
| // Convert to array_concat expression and evaluate it | ||
| let exprs: Vec<Expr> = args.args.into_iter().map(|a| a.into_expr()).collect(); | ||
| let expr = Self::to_array_concat(exprs)?; | ||
| return expr.eval(args.context); | ||
| } | ||
|
|
||
| // Original string concatenation logic | ||
| let ScalarFunctionArgs { args, .. } = args; |
There was a problem hiding this comment.
Fundamental design issue: Cannot convert ColumnarValue to Expr for evaluation.
This approach has critical flaws:
-
Line 151:
ColumnarValue::into_expr()does not exist.ColumnarValuerepresents runtime columnar data (arrays/scalars), whileExprrepresents logical expressions. They cannot be directly converted. -
Line 153:
Expr::eval(args.context)is not a valid API. Expressions are evaluated through the execution engine, not directly. -
Conceptual issue:
invoke_with_argsreceives already-evaluated values. The correct approach is to call the array concatenation logic directly on theColumnarValuearguments, not attempt to rebuild and re-evaluate expressions.
Consider instead:
- Import and call the
array_concatkernel/function directly on theColumnarValuearguments - Or delegate to the existing
ArrayConcatUDF's invoke method
#!/bin/bash
# Find how ArrayConcat is implemented for direct invocation
rg -n "impl.*ScalarUDFImpl.*for.*ArrayConcat" --type rust -A 20
# Find the array_concat kernel or function
rg -n "fn array_concat" --type rust -C5🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 147 to 157, do not try
to turn ColumnarValue into Expr or call Expr::eval; instead detect array
arguments at runtime and call the array-concatenation implementation directly on
the ColumnarValue inputs (or delegate to the existing ArrayConcat UDF
implementation). Concretely: remove the into_expr() map and
to_array_concat()/eval flow, match the incoming ScalarFunctionArgs.args for
ColumnarValue::Array/Scalar variants, call the array_concat kernel or
ArrayConcat.invoke_with_args (importing the module if needed) with the same
context/args, and return its ColumnarValue result; ensure errors are propagated
and types are validated before calling the kernel.
value:useful; category:bug; feedback:The Claude AI reviewer is correct! There is one more closing brace ( |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! There is a missing closing curly bracket ( |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! It seems this code is AI-generated and not reviewed by the PR author. The proposed changes are full with compilation errors. |
|
There is no point in reviewing this PR! It is AI-generated and not reviewed/polished by the PR author. |
19240: To review by AI